General Druid refactors #16708

adarshsanjeev · 2024-07-09T03:51:30Z

Some general refactors across Druid.

Switch to DruidExceptions
Add javadocs
Fix a bug in IntArrayColumns
Add a class for LongArrayColumns
Remove wireTransferable since it would never be called
Refactor DictionaryWriter to return the index written as a return value from write.

This PR has:

processing/src/main/java/org/apache/druid/segment/data/FrontCodedIntArrayIndexedWriter.java

processing/src/main/java/org/apache/druid/segment/data/FixedIndexedWriter.java

...ssing/src/main/java/org/apache/druid/segment/loading/MMappedQueryableSegmentizerFactory.java

processing/src/main/java/org/apache/druid/segment/data/VByte.java

clintropolis · 2024-07-09T04:48:17Z

processing/src/main/java/org/apache/druid/query/rowsandcols/concrete/ColumnHolderRACColumn.java

@@ -91,7 +91,7 @@ public int numRows()
      public boolean isNull(int rowNum)
      {
        offset.set(rowNum);
-        return valueSelector.isNull();
+        return valueSelector.getObject() == null;


does this have a different contract than BaseNullableColumnValueSelector? Its isNull method means that getLong/getFloat/getDouble will return a null value, not necessarily that getObject is null. E.g. sometimes non-numeric selectors will implement these primitive getters, but not necessarily be null rows.

Looking at some of the things that implement it, it doesn't appear to be different, e.g. ObjectColumnAccessorBase has an implementation of getLong that checks if the instance is a String, and if so, tries to parse it. This means that getObject will not return null, but getLong isn't valid, so there is no way to determine that the 0 it returns is actually a null because we couldn't parse the string 'hello' into a long.

The contract here is supposed to be abandoning the "numerical null" thing that the ValueSelectors follow. It is explicitly "is it null" rather than "will it be null if cast to a primitive type"

how do you know if it is safe to call getLong if that is the contract?

Or maybe we should eliminate implicit casting in the numeric getter methods to make it so that they only work if the column type really is a number?

From the javadoc for ColumnAccessor (Get whether the value of a row is null), the updated behaviour would be correct.

i think we need to revisit this contract, since it doesn't really feel compatible to coexist with implicit casting on the getLong/getFloat etc methods. If you have a string column but want to read numbers from it but not all of them can be parsed into numbers, you can't reliably use any of the primitive getters, because isNull would say false, but getLong would not be parseable into a number and return 0. See ObjectColumnAccessorBase.

This means you can't actually use these methods if you want to try to get numbers from an object column, so they should probably instead throw exceptions instead of silently giving you zeros that should have been handled as nulls.

We can fix this by just forbidding this sort of implicit casting, I'm not sure it does us any favors in terms of making code clear to follow, while having explicit casting wrapper gizmos would make it obvious when something is turning a string into a number.

not that it has to be done now, but it does seem problematic and is why ColumnValueSelector.isNull has the strange contract it does, because you kind of need it if you allow implicit casting.

The semantics as defined here is that a non-parseable String, when read from getLong is 0 instead of null. If something wants that to be null, it needs to do its own parsing/handling. It is entirely possible to, if you have a String object, use the getObject() form and handle it yourself.

We could make this logic be generically accessible by having a LongAccessor interface and ask the column to become one of those when it wants to force-cast to Long. What I'm trying to say is that I understand the semantics you are looking for and I think that it's totally valid, I don't think it has to be delivered from this set of getters which just exist to provide primitive values in the simplest possible way. That said, perhaps if we had the type-specific accessors we would end up not needing these generic get methods and then life would just be better. Either way, it's something that can be adjusted in another PR.

processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java

clintropolis · 2024-07-09T04:51:46Z

processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java

+        return FindResult.found(startIndex, end);
+      }
+
+      int i = Arrays.binarySearch(vals, startIndex, endIndex, val);


I know this isn't really new since the others do this too, but is there any guarantee that vals is sorted?

These methods are part of the BinarySearchableAccessor, which is fundamentally just saying "if you want to binary search and you know it's safe, use these methods". Instead of making the accessor itself implement this, it probably would've been better to make it another interface off of the Column. That's my bad really.

Fwiw, the whole BinarySearchableAccessor is actually only used by an experimental join implementation that exists to exercise the operator stuff. Your commentary about the oddities are totally valid and belong against that. It's not really a solidified part of the code base though, so we should perhaps add some javadocs to the interface itself, talk a bit about why it's awkward and when/if we get back to it, we can improve it.

The worst-case time complexity of findLong is O(n) if all the values from startIndex to endIndex are the same. It can be improved by doing two binary searches to find the upper and the lower bound. The current implementation is the same in other places. Is it worth switching to the proposed one instead?
If the array always has mostly distinct values, then the current implementation would work better on average.

No, none of the "find" stuff is actually being used for anything meaningful. Ignore it.

I'm not sure if it's more likely that we have distinct values here, but currently, this is in line with the other type columns. If we make a call on using multiple searches instead, that should be consistent with the other columns here.

@adarshsanjeev Lets java doc the oddity for now and move forward? We can take up this work as part of a follow up refactor ?

I have added this to the javadoc in BinarySearchableAccessor

clintropolis · 2024-07-09T04:53:27Z

processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java

+    @Override
+    public FindResult findComplex(int startIndex, int endIndex, Object val)
+    {
+      return findLong(startIndex, endIndex, Numbers.tryParseLong(val, 0));


i know this isn't really new, but why doesn't this need to distinguish between the case where val is parseable into a number and not?

The meta answer to your question is because it's not an interface that's used for anything actually meaningful yet.

The actual answer to your question is that the one place that is trying to use this is first checking for null and then calling findNull It then does a switch on the searcher's type and calls the relevant method for that. The type of the LongArrayColumn is LONG so it won't actually call findComplex() and will call findLong() instead. This is fundamentally putting the responsibility on the call site to "do the right thing". Is that the correct behavior or not? I dunno, but it's an experimental interface that you would have to be incredibly intrepid to actually hit, so it's a detail that perhaps doesn't matter.

If your suggestion is to switch this to a NotYetImplemented exception, I'd probably agree with that suggestion.

Changed the numberColumns to throw an exception instead of using a 0 if the string is unparsable for findComplex and findString

imply-cheddar

Generally LGTM. Make sure you have green checks and that clint is satisfied before merge.

imply-cheddar · 2024-07-09T05:25:33Z

processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java

+        return FindResult.found(startIndex, end);
+      }
+
+      int i = Arrays.binarySearch(vals, startIndex, endIndex, val);


These methods are part of the BinarySearchableAccessor, which is fundamentally just saying "if you want to binary search and you know it's safe, use these methods". Instead of making the accessor itself implement this, it probably would've been better to make it another interface off of the Column. That's my bad really.

processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java

imply-cheddar · 2024-07-09T05:28:08Z

processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java

+        return FindResult.found(startIndex, end);
+      }
+
+      int i = Arrays.binarySearch(vals, startIndex, endIndex, val);


Fwiw, the whole BinarySearchableAccessor is actually only used by an experimental join implementation that exists to exercise the operator stuff. Your commentary about the oddities are totally valid and belong against that. It's not really a solidified part of the code base though, so we should perhaps add some javadocs to the interface itself, talk a bit about why it's awkward and when/if we get back to it, we can improve it.

imply-cheddar · 2024-07-09T05:30:16Z

processing/src/main/java/org/apache/druid/query/rowsandcols/concrete/ColumnHolderRACColumn.java

@@ -91,7 +91,7 @@ public int numRows()
      public boolean isNull(int rowNum)
      {
        offset.set(rowNum);
-        return valueSelector.isNull();
+        return valueSelector.getObject() == null;


The contract here is supposed to be abandoning the "numerical null" thing that the ValueSelectors follow. It is explicitly "is it null" rather than "will it be null if cast to a primitive type"

processing/src/main/java/org/apache/druid/segment/data/FixedIndexedWriter.java

processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java

LakshSingla · 2024-07-11T05:01:45Z

processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java

+        return FindResult.found(startIndex, end);
+      }
+
+      int i = Arrays.binarySearch(vals, startIndex, endIndex, val);


The worst-case time complexity of findLong is O(n) if all the values from startIndex to endIndex are the same. It can be improved by doing two binary searches to find the upper and the lower bound. The current implementation is the same in other places. Is it worth switching to the proposed one instead?
If the array always has mostly distinct values, then the current implementation would work better on average.

kgyrtkirk · 2024-07-19T15:19:20Z

processing/src/main/java/org/apache/druid/error/NotYetImplemented.java

+ */
+public class NotYetImplemented extends DruidException.Failure
+{
+  public static DruidException ex(String msg, Object... args)


I think it would be better to not have factory methods which omit the cause exception
I've seen places where there was a catch block - but the newly created exception did not had the cause filled in most likely by mistake - but as a result the stacktrace was clipped.

I think forcing the api user to write out null for cause makes this mistake less probable to do the same

kgyrtkirk · 2024-07-22T08:32:49Z

processing/src/main/java/org/apache/druid/error/DruidException.java

+        ++firstNonDruidExceptionIndex;
+      }
+      if (firstNonDruidExceptionIndex < stackTrace.length) {
+        retVal.setStackTrace(Arrays.copyOfRange(stackTrace, firstNonDruidExceptionIndex, stackTrace.length));


this conditional looks odd... its not wrong; but it will only skip the set if it went over the full stacktrace; but when its 0 it still does it...

I guess the intention here is to clip the exception construction process?
could you add a test for this?
with a case which checks what happens if a DruidException is set as a cause for another DruidException

Changed this to if (firstNonDruidExceptionIndex > 0)

If firstNonDruidExceptionIndex == stackTrace.length, we should copy an empty array, which is what would happen, and if the the exception is 0, we should not need a copy.

processing/src/main/java/org/apache/druid/error/NotYetImplemented.java

…ted.java

cheddar · 2024-08-06T16:46:31Z

Just looking at the comments, it looks like things have generally been addressed. Clint has an outstanding comment about the semantics of the ColumnAccessor interface, which are valid but also not the central theme of this PR. As such, I don't see a reason to delay the merge of these changes, so going to push the merge button.

adarshsanjeev added 3 commits July 8, 2024 15:57

Minor refactors

a8dcf15

Refactor write to return index of value written

57e9b02

Refactor compressionFactory

cb79da4

github-actions bot added the Area - Segment Format and Ser/De label Jul 9, 2024

Fix checkstyle

615bf32

github-advanced-security bot found potential problems Jul 9, 2024

View reviewed changes

processing/src/main/java/org/apache/druid/segment/data/FrontCodedIntArrayIndexedWriter.java Dismissed Show dismissed Hide dismissed

clintropolis reviewed Jul 9, 2024

View reviewed changes

imply-cheddar approved these changes Jul 9, 2024

View reviewed changes

adarshsanjeev mentioned this pull request Jul 9, 2024

Add a segmentMorphFactory to MSQ Datasource Destination #16692

Closed

10 tasks

adarshsanjeev added 2 commits July 10, 2024 23:07

Remove some unused code

d09706c

Fix broken test

ee8766b

LakshSingla reviewed Jul 11, 2024

View reviewed changes

adarshsanjeev added 7 commits July 12, 2024 22:48

Add tests

70b13da

Additional changes

c80eaf7

Switch to throwing an exception for invalid find calls

396c71f

Remove comment

ceafa09

Merge remote-tracking branch 'origin/master' into misc-refactors

5f014e3

Merge remote-tracking branch 'origin/master' into misc-refactors

927f7b9

Minor changes

33732ac

kgyrtkirk reviewed Jul 22, 2024

View reviewed changes

asdf2014 reviewed Jul 22, 2024

View reviewed changes

processing/src/main/java/org/apache/druid/error/NotYetImplemented.java Outdated Show resolved Hide resolved

asdf2014 and others added 4 commits July 22, 2024 17:42

Update processing/src/main/java/org/apache/druid/error/NotYetImplemen…

496376a

…ted.java

Address review comments

ab36689

Revert accidental change

f58ffba

Refactor

f9befcd

cheddar merged commit 7390684 into apache:master Aug 6, 2024
88 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General Druid refactors #16708

General Druid refactors #16708

adarshsanjeev commented Jul 9, 2024

clintropolis Jul 9, 2024

imply-cheddar Jul 9, 2024

clintropolis Jul 9, 2024

clintropolis Jul 9, 2024

adarshsanjeev Jul 11, 2024

clintropolis Jul 11, 2024

clintropolis Jul 11, 2024 •

edited

Loading

imply-cheddar Aug 6, 2024

clintropolis Jul 9, 2024

imply-cheddar Jul 9, 2024

imply-cheddar Jul 9, 2024

LakshSingla Jul 11, 2024 •

edited

Loading

imply-cheddar Jul 12, 2024

adarshsanjeev Jul 12, 2024

cryptoe Jul 22, 2024 •

edited

Loading

adarshsanjeev Jul 22, 2024

clintropolis Jul 9, 2024

imply-cheddar Jul 12, 2024

adarshsanjeev Jul 15, 2024

imply-cheddar left a comment

imply-cheddar Jul 9, 2024

imply-cheddar Jul 9, 2024

imply-cheddar Jul 9, 2024

LakshSingla Jul 11, 2024 •

edited

Loading

kgyrtkirk Jul 19, 2024

adarshsanjeev Jul 24, 2024

kgyrtkirk Jul 22, 2024

adarshsanjeev Jul 24, 2024

cheddar commented Aug 6, 2024

General Druid refactors #16708

General Druid refactors #16708

Conversation

adarshsanjeev commented Jul 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryptoe Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imply-cheddar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheddar commented Aug 6, 2024

clintropolis Jul 11, 2024 •

edited

Loading

LakshSingla Jul 11, 2024 •

edited

Loading

cryptoe Jul 22, 2024 •

edited

Loading

LakshSingla Jul 11, 2024 •

edited

Loading